{
  "nbformat_minor": 0,
  "nbformat": 4,
  "cells": [
    {
      "source": [
        "#### Copyright 2017 Google LLC."
      ],
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "JndnmDMp66FL"
      }
    },
    {
      "source": [
        "# Licensed under the Apache License, Version 2.0 (the \"License\");\n",
        "# you may not use this file except in compliance with the License.\n",
        "# You may obtain a copy of the License at\n",
        "#\n",
        "# https://www.apache.org/licenses/LICENSE-2.0\n",
        "#\n",
        "# Unless required by applicable law or agreed to in writing, software\n",
        "# distributed under the License is distributed on an \"AS IS\" BASIS,\n",
        "# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n",
        "# See the License for the specific language governing permissions and\n",
        "# limitations under the License."
      ],
      "cell_type": "code",
      "metadata": {
        "colab_type": "code",
        "id": "hMqWDc_m6rUC",
        "colab": {
          "autoexec": {
            "wait_interval": 0,
            "startup": false
          }
        },
        "cellView": "both"
      },
      "outputs": [],
      "execution_count": 0
    },
    {
      "source": [
        " # Pr\u00e9sentation rapide de Pandas"
      ],
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "rHLcriKWLRe4"
      }
    },
    {
      "source": [
        "**Objectifs d'apprentissage :**\n",
        "  * Introduction aux structures de donn\u00e9es `DataFrame` et `Series` de la biblioth\u00e8que *Pandas*\n",
        "  * Acc\u00e9der aux donn\u00e9es et les manipuler dans une structure `DataFrame` et `Series`\n",
        "  * Importer des donn\u00e9es d'un fichier CSV dans un `DataFrame` *Pandas*\n",
        "  * R\u00e9indexer un `DataFrame` pour m\u00e9langer les donn\u00e9es"
      ],
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "QvJBqX8_Bctk"
      }
    },
    {
      "source": [
        " [*Pandas*](http://pandas.pydata.org/) est une API d'analyse de donn\u00e9es orient\u00e9e colonnes. C'est un excellent outil pour manipuler et analyser des donn\u00e9es d'entr\u00e9e. Beaucoup de frameworks d'apprentissage automatique acceptent les structures de donn\u00e9es *Pandas* en entr\u00e9e.\n",
        "Il faudrait des pages et des pages pour pr\u00e9senter de mani\u00e8re exhaustive l'API *Pandas*, mais les concepts fondamentaux sont relativement simples. Aussi, nous avons d\u00e9cid\u00e9 de vous les exposer ci-dessous. Pour une description plus compl\u00e8te, vous pouvez consulter le [site de documentation de *Pandas*](http://pandas.pydata.org/pandas-docs/stable/index.html), sur lequel vous trouverez de multiples informations ainsi que de nombreux didacticiels."
      ],
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "TIFJ83ZTBctl"
      }
    },
    {
      "source": [
        " ## Concepts de base\n",
        "\n",
        "La ligne de code suivante permet d'importer l'API *pandas* et d'afficher sa version\u00a0:"
      ],
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "s_JOISVgmn9v"
      }
    },
    {
      "source": [
        "from __future__ import print_function\n\n",
        "import pandas as pd\n",
        "pd.__version__"
      ],
      "cell_type": "code",
      "metadata": {
        "colab_type": "code",
        "id": "aSRYu62xUi3g",
        "colab": {
          "autoexec": {
            "wait_interval": 0,
            "startup": false
          }
        }
      },
      "outputs": [],
      "execution_count": 0
    },
    {
      "source": [
        " On distingue deux\u00a0grandes cat\u00e9gories de structures de donn\u00e9es *Pandas*\u00a0:\n",
        "\n",
        "  * Le **`DataFrame`**, un tableau relationnel de donn\u00e9es, avec des lignes et des colonnes \u00e9tiquet\u00e9es\n",
        "  * La **`Series`**, constitu\u00e9e d'une seule colonne. Un `DataFrame` contient une ou plusieurs `Series`, chacune \u00e9tant \u00e9tiquet\u00e9e.\n",
        "\n",
        "Le DataFrame est une abstraction fr\u00e9quemment utilis\u00e9e pour manipuler des donn\u00e9es. [Spark](https://spark.apache.org/) et [R](https://www.r-project.org/about.html) proposent des impl\u00e9mentations similaires."
      ],
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "daQreKXIUslr"
      }
    },
    {
      "source": [
        " Pour cr\u00e9er une `Series`, vous pouvez notamment cr\u00e9er un objet `Series`. Par exemple\u00a0:"
      ],
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "fjnAk1xcU0yc"
      }
    },
    {
      "source": [
        "pd.Series(['San Francisco', 'San Jose', 'Sacramento'])"
      ],
      "cell_type": "code",
      "metadata": {
        "colab_type": "code",
        "id": "DFZ42Uq7UFDj",
        "colab": {
          "autoexec": {
            "wait_interval": 0,
            "startup": false
          }
        }
      },
      "outputs": [],
      "execution_count": 0
    },
    {
      "source": [
        " Il est possible de cr\u00e9er des objets `DataFrame` en transmettant un `dictionnaire` qui met en correspondance les noms de colonnes (des `cha\u00eenes de caract\u00e8res`) avec leur `Series` respective. Lorsque la longueur de la `Series` ne correspond pas, les valeurs manquantes sont remplac\u00e9es par des valeurs [NA/NaN](http://pandas.pydata.org/pandas-docs/stable/missing_data.html) sp\u00e9ciales. Exemple\u00a0:"
      ],
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "U5ouUp1cU6pC"
      }
    },
    {
      "source": [
        "city_names = pd.Series(['San Francisco', 'San Jose', 'Sacramento'])\n",
        "population = pd.Series([852469, 1015785, 485199])\n",
        "\n",
        "pd.DataFrame({ 'City name': city_names, 'Population': population })"
      ],
      "cell_type": "code",
      "metadata": {
        "colab_type": "code",
        "id": "avgr6GfiUh8t",
        "colab": {
          "autoexec": {
            "wait_interval": 0,
            "startup": false
          }
        }
      },
      "outputs": [],
      "execution_count": 0
    },
    {
      "source": [
        " Le plus souvent, vous chargez un fichier entier dans un `DataFrame`. Dans l'exemple qui suit, le fichier charg\u00e9 contient des donn\u00e9es immobili\u00e8res pour la Californie. Ex\u00e9cutez la cellule suivante pour charger les donn\u00e9es et d\u00e9finir les caract\u00e9ristiques\u00a0:"
      ],
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "oa5wfZT7VHJl"
      }
    },
    {
      "source": [
        "california_housing_dataframe = pd.read_csv(\"https://download.mlcc.google.com/mledu-datasets/california_housing_train.csv\", sep=\",\")\n",
        "california_housing_dataframe.describe()"
      ],
      "cell_type": "code",
      "metadata": {
        "colab_type": "code",
        "id": "av6RYOraVG1V",
        "colab": {
          "autoexec": {
            "wait_interval": 0,
            "startup": false
          }
        }
      },
      "outputs": [],
      "execution_count": 0
    },
    {
      "source": [
        " Dans l'exemple ci-dessus, la m\u00e9thode `DataFrame.describe` permet d'afficher des statistiques int\u00e9ressantes concernant un `DataFrame`. La fonction `DataFrame.head` est \u00e9galement utile pour afficher les premiers enregistrements d'un `DataFrame`\u00a0:"
      ],
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "WrkBjfz5kEQu"
      }
    },
    {
      "source": [
        "california_housing_dataframe.head()"
      ],
      "cell_type": "code",
      "metadata": {
        "colab_type": "code",
        "id": "s3ND3bgOkB5k",
        "colab": {
          "autoexec": {
            "wait_interval": 0,
            "startup": false
          }
        }
      },
      "outputs": [],
      "execution_count": 0
    },
    {
      "source": [
        " Autre fonction puissante de *Pandas*\u00a0: la repr\u00e9sentation graphique. Avec `DataFrame.hist`, par exemple, vous pouvez v\u00e9rifier rapidement la fa\u00e7on dont les valeurs d'une colonne sont distribu\u00e9es\u00a0:"
      ],
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "w9-Es5Y6laGd"
      }
    },
    {
      "source": [
        "california_housing_dataframe.hist('housing_median_age')"
      ],
      "cell_type": "code",
      "metadata": {
        "colab_type": "code",
        "id": "nqndFVXVlbPN",
        "colab": {
          "autoexec": {
            "wait_interval": 0,
            "startup": false
          }
        }
      },
      "outputs": [],
      "execution_count": 0
    },
    {
      "source": [
        " ## Acc\u00e9der aux donn\u00e9es\n",
        "\n",
        "L'acc\u00e8s aux donn\u00e9es d'un `DataFrame` s'effectue au moyen d'op\u00e9rations de liste ou de dictionnaire Python courantes\u00a0:"
      ],
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "XtYZ7114n3b-"
      }
    },
    {
      "source": [
        "cities = pd.DataFrame({ 'City name': city_names, 'Population': population })\n",
        "print(type(cities['City name']))\n",
        "cities['City name']"
      ],
      "cell_type": "code",
      "metadata": {
        "colab_type": "code",
        "id": "_TFm7-looBFF",
        "colab": {
          "autoexec": {
            "wait_interval": 0,
            "startup": false
          }
        }
      },
      "outputs": [],
      "execution_count": 0
    },
    {
      "source": [
        "print(type(cities['City name'][1]))\n",
        "cities['City name'][1]"
      ],
      "cell_type": "code",
      "metadata": {
        "colab_type": "code",
        "id": "V5L6xacLoxyv",
        "colab": {
          "autoexec": {
            "wait_interval": 0,
            "startup": false
          }
        }
      },
      "outputs": [],
      "execution_count": 0
    },
    {
      "source": [
        "print(type(cities[0:2]))\n",
        "cities[0:2]"
      ],
      "cell_type": "code",
      "metadata": {
        "colab_type": "code",
        "id": "gcYX1tBPugZl",
        "colab": {
          "autoexec": {
            "wait_interval": 0,
            "startup": false
          }
        }
      },
      "outputs": [],
      "execution_count": 0
    },
    {
      "source": [
        " *Pandas* propose en outre une API extr\u00eamement riche, avec des fonctions avanc\u00e9es d'[indexation et de s\u00e9lection](http://pandas.pydata.org/pandas-docs/stable/indexing.html), que nous ne pouvons malheureusement pas aborder ici dans le d\u00e9tail."
      ],
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "65g1ZdGVjXsQ"
      }
    },
    {
      "source": [
        " ## Manipuler des donn\u00e9es\n",
        "\n",
        "Il est possible d'effectuer des op\u00e9rations arithm\u00e9tiques de base de Python sur les `Series`. Par exemple\u00a0:"
      ],
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "RM1iaD-ka3Y1"
      }
    },
    {
      "source": [
        "population / 1000."
      ],
      "cell_type": "code",
      "metadata": {
        "colab_type": "code",
        "id": "XWmyCFJ5bOv-",
        "colab": {
          "autoexec": {
            "wait_interval": 0,
            "startup": false
          }
        }
      },
      "outputs": [],
      "execution_count": 0
    },
    {
      "source": [
        " [NumPy](http://www.numpy.org/) est un kit d'outils de calculs scientifiques populaire. Les `Series` *Pandas* peuvent faire office d'arguments pour la plupart des fonctions NumPy\u00a0:"
      ],
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "TQzIVnbnmWGM"
      }
    },
    {
      "source": [
        "import numpy as np\n",
        "\n",
        "np.log(population)"
      ],
      "cell_type": "code",
      "metadata": {
        "colab_type": "code",
        "id": "ko6pLK6JmkYP",
        "colab": {
          "autoexec": {
            "wait_interval": 0,
            "startup": false
          }
        }
      },
      "outputs": [],
      "execution_count": 0
    },
    {
      "source": [
        " La m\u00e9thode `Series.apply` convient pour les transformations \u00e0 une colonne plus complexes. Comme la [fonction `map`](https://docs.python.org/2/library/functions.html#map) de Python, elle accepte en argument une [fonction `lambda`](https://docs.python.org/2/tutorial/controlflow.html#lambda-expressions), appliqu\u00e9e \u00e0 chaque valeur.\n",
        "\n",
        "L'exemple ci-dessous permet de cr\u00e9er une `Series` signalant si la `population` d\u00e9passe ou non un million d'habitants\u00a0:"
      ],
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "xmxFuQmurr6d"
      }
    },
    {
      "source": [
        "population.apply(lambda val: val > 1000000)"
      ],
      "cell_type": "code",
      "metadata": {
        "colab_type": "code",
        "id": "Fc1DvPAbstjI",
        "colab": {
          "autoexec": {
            "wait_interval": 0,
            "startup": false
          }
        }
      },
      "outputs": [],
      "execution_count": 0
    },
    {
      "source": [
        " \n",
        "La modification des `DataFrames` est \u00e9galement tr\u00e8s simple. Ainsi, le code suivant permet d'ajouter deux\u00a0`Series` \u00e0 un `DataFrame` existant\u00a0:"
      ],
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "ZeYYLoV9b9fB"
      }
    },
    {
      "source": [
        "cities['Area square miles'] = pd.Series([46.87, 176.53, 97.92])\n",
        "cities['Population density'] = cities['Population'] / cities['Area square miles']\n",
        "cities"
      ],
      "cell_type": "code",
      "metadata": {
        "colab_type": "code",
        "id": "0gCEX99Hb8LR",
        "colab": {
          "autoexec": {
            "wait_interval": 0,
            "startup": false
          }
        }
      },
      "outputs": [],
      "execution_count": 0
    },
    {
      "source": [
        " ## Exercice n\u00b0\u00a01\n",
        "\n",
        "Modifiez le tableau `cities` en ajoutant une colonne bool\u00e9enne qui prend la valeur True si et seulement si les *deux*\u00a0conditions suivantes sont remplies\u00a0:\n",
        "\n",
        "  * La ville porte le nom d'un saint.\n",
        "  * La ville s'\u00e9tend sur plus de 50\u00a0miles carr\u00e9s.\n",
        "\n",
        "**Remarque**\u00a0: Pour combiner des `Series` bool\u00e9ennes, utilisez des op\u00e9rateurs de bits, pas les op\u00e9rateurs bool\u00e9ens classiques. Par exemple, pour le *ET logique*, utilisez `&` au lieu de `and`.\n",
        "\n",
        "**Astuce**\u00a0: En espagnol, \"San\" signifie \"saint\"."
      ],
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "6qh63m-ayb-c"
      }
    },
    {
      "source": [
        "# Your code here"
      ],
      "cell_type": "code",
      "metadata": {
        "colab_type": "code",
        "id": "zCOn8ftSyddH",
        "colab": {
          "autoexec": {
            "wait_interval": 0,
            "startup": false
          }
        }
      },
      "outputs": [],
      "execution_count": 0
    },
    {
      "source": [
        " ### Solution\n",
        "\n",
        "Cliquez ci-dessous pour afficher la solution."
      ],
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "YHIWvc9Ms-Ll"
      }
    },
    {
      "source": [
        "cities['Is wide and has saint name'] = (cities['Area square miles'] > 50) & cities['City name'].apply(lambda name: name.startswith('San'))\n",
        "cities"
      ],
      "cell_type": "code",
      "metadata": {
        "colab_type": "code",
        "id": "T5OlrqtdtCIb",
        "colab": {
          "autoexec": {
            "wait_interval": 0,
            "startup": false
          }
        }
      },
      "outputs": [],
      "execution_count": 0
    },
    {
      "source": [
        " ## Index\n",
        "Les objets `Series` et `DataFrame` d\u00e9finissent \u00e9galement une propri\u00e9t\u00e9 `index`, qui affecte un identifiant \u00e0 chaque \u00e9l\u00e9ment d'une `Series` ou chaque ligne d'un `DataFrame`. \n",
        "\n",
        "Par d\u00e9faut, *Pandas* affecte les valeurs d'index en suivant l'ordre des donn\u00e9es source. Ces valeurs ne varient pas par la suite\u00a0; elles restent inchang\u00e9es en cas de r\u00e9agencement des donn\u00e9es."
      ],
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "f-xAOJeMiXFB"
      }
    },
    {
      "source": [
        "city_names.index"
      ],
      "cell_type": "code",
      "metadata": {
        "colab_type": "code",
        "id": "2684gsWNinq9",
        "colab": {
          "autoexec": {
            "wait_interval": 0,
            "startup": false
          }
        }
      },
      "outputs": [],
      "execution_count": 0
    },
    {
      "source": [
        "cities.index"
      ],
      "cell_type": "code",
      "metadata": {
        "colab_type": "code",
        "id": "F_qPe2TBjfWd",
        "colab": {
          "autoexec": {
            "wait_interval": 0,
            "startup": false
          }
        }
      },
      "outputs": [],
      "execution_count": 0
    },
    {
      "source": [
        " Appelez `DataFrame.reindex` pour r\u00e9organiser manuellement les lignes. Le code suivant, par exemple, revient \u00e0 trier les donn\u00e9es par nom de ville\u00a0:"
      ],
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "hp2oWY9Slo_h"
      }
    },
    {
      "source": [
        "cities.reindex([2, 0, 1])"
      ],
      "cell_type": "code",
      "metadata": {
        "colab_type": "code",
        "id": "sN0zUzSAj-U1",
        "colab": {
          "autoexec": {
            "wait_interval": 0,
            "startup": false
          }
        }
      },
      "outputs": [],
      "execution_count": 0
    },
    {
      "source": [
        " La r\u00e9indexation est un excellent moyen de m\u00e9langer (organiser al\u00e9atoirement) les donn\u00e9es d'un `DataFrame`. Dans l'exemple ci-dessous, l'index de type tableau est transmis \u00e0 la fonction NumPy `random.permutation`, qui m\u00e9lange les valeurs. En appelant `reindex` avec ce tableau m\u00e9lang\u00e9, nous m\u00e9langeons \u00e9galement les lignes du `DataFrame`.\n",
        "Ex\u00e9cutez plusieurs fois la cellule suivante\u00a0!"
      ],
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "-GQFz8NZuS06"
      }
    },
    {
      "source": [
        "cities.reindex(np.random.permutation(cities.index))"
      ],
      "cell_type": "code",
      "metadata": {
        "colab_type": "code",
        "id": "mF8GC0k8uYhz",
        "colab": {
          "autoexec": {
            "wait_interval": 0,
            "startup": false
          }
        }
      },
      "outputs": [],
      "execution_count": 0
    },
    {
      "source": [
        " Pour en savoir plus, consultez la [documentation relative aux index](http://pandas.pydata.org/pandas-docs/stable/indexing.html#index-objects)."
      ],
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "fSso35fQmGKb"
      }
    },
    {
      "source": [
        " ## Exercice n\u00b0\u00a02\n",
        "\n",
        "La m\u00e9thode `reindex` autorise les valeurs d'index autres que celles associ\u00e9es au `DataFrame` d'origine. Voyez ce qu'il se passe lorsque vous utilisez ce type de valeurs\u00a0! Pourquoi est-ce autoris\u00e9 \u00e0 votre avis\u00a0?"
      ],
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "8UngIdVhz8C0"
      }
    },
    {
      "source": [
        "# Your code here"
      ],
      "cell_type": "code",
      "metadata": {
        "colab_type": "code",
        "id": "PN55GrDX0jzO",
        "colab": {
          "autoexec": {
            "wait_interval": 0,
            "startup": false
          }
        }
      },
      "outputs": [],
      "execution_count": 0
    },
    {
      "source": [
        " ### Solution\n",
        "\n",
        "Cliquez ci-dessous pour afficher la solution."
      ],
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "TJffr5_Jwqvd"
      }
    },
    {
      "source": [
        " Lorsque le tableau d'entr\u00e9e `reindex` contient des valeurs d'index ne faisant pas partie de la liste des index du `DataFrame` d'origine, `reindex` ajoute des lignes pour ces index \\'manquants\\' et ins\u00e8re la valeur `NaN` dans les colonnes correspondantes\u00a0:"
      ],
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "8oSvi2QWwuDH"
      }
    },
    {
      "source": [
        "cities.reindex([0, 4, 5, 2])"
      ],
      "cell_type": "code",
      "metadata": {
        "colab_type": "code",
        "id": "yBdkucKCwy4x",
        "colab": {
          "autoexec": {
            "wait_interval": 0,
            "startup": false
          }
        }
      },
      "outputs": [],
      "execution_count": 0
    },
    {
      "source": [
        " Ce comportement est souhaitable, car les index sont souvent des cha\u00eenes de caract\u00e8res extraites des donn\u00e9es r\u00e9elles. La [documentation *Pandas* sur la fonction `reindex`](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.reindex.html) inclut un exemple avec des noms de navigateurs comme valeurs d'index).\n",
        "\n",
        "Dans ce cas, autoriser les index \\'manquants\\' facilite la r\u00e9indexation \u00e0 l'aide d'une liste externe, car vous n'avez pas \u00e0 vous pr\u00e9occuper de l'int\u00e9grit\u00e9 des donn\u00e9es d'entr\u00e9e."
      ],
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "2l82PhPbwz7g"
      }
    }
  ],
  "metadata": {
    "colab": {
      "default_view": {},
      "version": "0.3.2",
      "collapsed_sections": [
        "JndnmDMp66FL",
        "YHIWvc9Ms-Ll",
        "TJffr5_Jwqvd"
      ],
      "name": "intro_to_pandas.ipynb",
      "views": {}
    },
    "kernelspec": {
      "name": "python3",
      "display_name": "Python 3"
    }
  }
}
